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Abstract 

Solomonoff unified Occam's razor and Epicurus' principle of multiple expla- 
nations to one elegant, formal, universal theory of inductive inference, which 
initiated the field of algorithmic information theory. His central result is that 
the posterior of his universal semimeasure M converges rapidly to the true se- 
quence generating posterior /i, if the latter is computable. Hence, M is eligible 
as a universal predictor in case of unknown fi. We investigate the existence 
and convergence of computable universal (semi)measures for a hierarchy of 
computability classes: finitely computable, estimable, enumerable, and ap- 
proximable. For instance, M is known to be enumerable, but not finitely 
computable, and to dominate all enumerable semimeasures. We define seven 
classes of (semi)measures based on these four computability concepts. Each 
class may or may not contain a (semi)measure which dominates all elements 
of another class. The analysis of these 49 cases can be reduced to four basic 
cases, two of them being new. The results hold for discrete and continuous 
semimeasures. We also investigate more closely the types of convergence, pos- 
sibly implied by universality: in difference and in ratio, with probability 1, 
in mean sum, and for Martin-L6f random sequences. We introduce a gener- 
alized concept of randomness for individual sequences and use it to exhibit 
difficulties regarding these issues. 

Keywords 

Sequence prediction; Algorithmic Information Theory; Solomonoff's prior; 

universal probability; mixture distributions; posterior convergence; com- 
putability concepts; Martin-L6f randomness. 
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1 Introduction 

All induction problems can be phrased as sequence prediction tasks. This is, for 
instance, obvious for time series prediction, but also includes classification tasks. 
Having observed data x t at times t < n, the task is to predict the t-th symbol x t 
from sequence x = x±...Xt-i- The key concept to attack general induction problems 
is Occam's razor and to a less extend Epicurus' principle of multiple explanations. 
The former/latter may be interpreted as to keep the simplest/all theories consistent 
with the observations Xi...x t -\ and to use these theories to predict x t . Solomonoff 
|Sol64, Sol78j formalized and combined both principles in his universal prior M(x) 
which assigns high/low probability to simple/complex environments, hence imple- 
menting Occam and Epicurus. Solomonoff 's [Sol78 central result is that if the prob- 
ability fi(x t \xi...x t -i) of observing x t at time t, given past observations x\...x t -\ is a 
computable function, then the universal posterior M[xt\x\...Xt-\) converges rapidly 
for t— >oo to the true posterior n(x t \xi...Xt-i), hence M represents a universal pre- 
dictor in case of unknown fi. 

One representation of M is as a weighted sum of all enumerable "defective" 
probability measures, called semimeasures (see Definition |2J) . The (from this repre- 
sentation obvious) dominance M(x)> const, x /i(x) for all computable \x is the central 
ingredient in the convergence proof. What is so special about the class of all enu- 
merable semimeasures M 8 ^^} The larger we choose Ai the less restrictive is the es- 
sential assumption that Ai should contain the true distribution /i. Why not restrict 
to the still rather general class of estimable or finitely computable (semi)measures? 
For every countable class Ai and £,m{ x ) := J2u£M w ^ h '( x ) with w u >0, the important 
dominance >iiw(a;) Wg A^ is satisfied. The question is what properties does 

£m possess. The distinguishing property of M=^M aemi is that, it is itself an element 
of AA s ^™ m . On the other hand, for prediction £m G A4 is not by itself an important 
property. What matters is whether £m is computable (in one of the senses defined) 
to avoid getting into the (un)realm of non-constructive math. 

The intention of this work is to investigate the existence, computability and 
convergence of universal (semi)measures for various computability classes: finitely 
computable C estimable C enumerable C approximable (see Definition |TJ). For 
instance, M(x) is enumerable, but not finitely computable. The research in this work 
was motivated by recent generalizations of Kolmogorov complexity and Solomonoff 's 
prior by Schmidhuber [Sch02j to approximable (and others not here discussed) cases. 

Contents. In Section |2 we review various computability concepts and discuss their 
relation. In Section |3] we define the prefix Kolmogorov complexity K, the concept of 
(semi)measures, Solomonoff's universal prior M, and explain its universality. Sec- 
tion ^summarizes Solomonoff's major convergence result, discusses general mixture 
distributions and the important universality property - multiplicative dominance. 
In Section |S] we define seven classes of (semi)measures based on four computability 
concepts. Each class may or may not contain a (semi)measures which dominates 
all elements of another class. We reduce the analysis of these 49 cases to four basic 
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cases. Domination (essentially by M) is known to be true for two cases. The two 
new cases do not allow for domination. In Section H we investigate more closely the 
type of convergence implied by universality. We summarize the result on posterior 
convergence in difference (£— /j,— >0) and improve the previous result |LV97] on the 
convergence in ratio —>■ 1 by showing rapid convergence without use of Martin- 
gales. In Section [7| we investigate whether convergence for all Martin-L6f random 
sequences could hold. We define a generalized concept of randomness for individual 
sequences and use it to show that proofs based on universality cannot decide this 
question. Section |H] concludes the paper. Proofs will be presented elsewhere. 

Notation. We denote strings of length n over finite alphabet X by x = X\X<i---x n 
with xtdX and further abbreviate x\- n :=x\X2---x n -\x n and x <n :=X\...x n -\, e for 
the empty string, l(x) for the length of string x, and u = xi :00 for infinite sequences. 
We abbreviate lim n _ >0O [/(ri) — g(n)] =0 by f(n)^?g(n) and say / converges to g, 

X 

without implying that \im n ^ 00 g(n) itself exists. We write f(x) >g{x) for g(x) = 



We define several computability concepts weaker than can be captured by halting 
Turing machines. 

Definition 1 (Computable functions) We consider functions f : IN —> M: 

f is finitely computable or recursive iff there are Turing machines Ty2 with 



f is approximable iff $(■,■) is finitely computable andlim t ^ OQ <p(x,t)—f(x). 
f is lower semi-computable or enumerable iff additionally <f>(x,t) <<p(x,t+l) . 
f is upper semi-computable or co-enumerable iff[—f] is lower semi-computable, 
f is semi-computable iff f is lower- or upper semi-computable, 
f is estimable iff f is lower- and upper semi- computable. 

If / is estimable we can finitely compute an ^-approximation of / by upper and 
lower semi-computing / and terminating when differing by less than e. This means 
that there is a Turing machine which, given x and e, finitely computes y such that 
\y— f(x) \ <e. Moreover it gives an interval estimate f(x) G [y—e,y+e]. An estimable 
integer- valued function is finitely computable (take any e< 1). Note that if / is only 
approximable or semi- computable we can still come arbitrarily close to f(x) but we 
cannot devise a terminating algorithm which produces an ^-approximation. In the 
case of lower/upper semi-computability we can at least finitely compute lower/upper 



0(f(x)). 



2 Computability Concepts 
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bounds to f(x). In case of approximability, the weakest computability form, even 
this capability is lost. In analogy to lower/upper semi-computability one may think 
of notions like lower/upper estimability but they are easily shown to coincide with 
estimability. The following implications are valid: 



recursive= 
finitely 
computable 



estimable 



enumerable= 
lower semi- 
computable 



co-enumerable= 
upper semi- 
computable 



semi- 
computable 



approximable 



In the following we use the term computable synonymous to finitely computable, 
but sometimes also generically for some of the computability forms of Definition 
^ What we call estimable is often just called computable, but it makes sense to 
separate the concepts of finite computability and estimability in this work, since the 
former is conceptually easier and some previous results have only been proved for 
this case. 



3 The Universal Prior M 

The prefix Kolmogorov complexity K(x) is defined as the length of the shortest 
binary program p G {0,1}* for which a universal prefix Turing machine U (with 
binary program tape and X&ry output tape) outputs string x<EX*, and similarly 
K(x\y) in case of side information y [LV97J: 

K(x) = min{l(p) : U(p) = x}, K(x\y) = min{Z(p) : U(p,y) = x} 

Solomonoff [S ol64t IS"ol78j (with a flaw fixed by Levin |ZL70| ) defined (earlier) the 
closely related quantity, the universal prior M(x). It is defined as the probability 
that the output of a universal Turing machine starts with x when provided with fair 
coin flips on the input tape. Formally, M can be defined as 

M{x) := J2 2 ~ 1(P) (!) 

p : U(p)=x* 

where the sum is over all so called minimal programs p for which U outputs a string 
starting with x (indicated by the *). Before we can discuss the stochastic properties 
of M we need the concept of (semi) measures for strings. 

Definition 2 (Continuous (Semi) measures) p,{x) denotes the probability that a 
sequence starts with string x. We call yu>0 a (continuous) semimeasure if fi(e)<l 
and fi(x)>/i(xO)+fi(xl), and a (probability) measure if equality holds. 
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We have M(xO)+M(xl) <M(x) because there are programs p, which output x, nei- 
ther followed by nor 1. They just stop after printing x or continue forever without 
any further output. Together with M(e) = 1 this shows that M is a semimeasure, 
but not a probability measure. We can now state the fundamental property of M 
ISSUE]: 

Theorem 3 (Universality of M) The universal prior M is an enumerable 
semimeasure which multiplicatively dominates all enumerable semimeasures in the 

sense that M(x) > 2~ K ( p ^ ■ p(x) for all an enumerable semimeasures p. M is enu- 
merable, but not estimable or finitely computable. 

The Kolmogorov complexity of a function like p is defined as the length of the 
shortest self-delimiting code of a Turing machine computing this function in the 
sense of Definition ^ Up to a multiplicative constant, M assigns higher probability 
to all x than any other computable probability distribution. 

It is possible to normalize M to a true probability measure M norm |Sol78[ EV97J 
with dominance still being true, but at the expense of giving up enumerability 
[Mnorm is still approximable) . M is more convenient when studying algorithmic 
questions, but a true probability measure like M norm is more convenient when study- 
ing stochastic questions. 

4 Universal Sequence Prediction 

In which sense does M incorporate Occam's razor and Epicurus' principle of multiple 
explanations? Since the shortest programs p dominate the sum in M, M(x) is 
roughly equal to 2~ K ^ (M(x) — 2~ K ^ + °^ K ^ X ^), i.e. M assigns high probability 
to simple strings. More useful is to think of x as being the observed history. We 
see from ([TJ that every program p consistent with history x is allowed to contribute 
to M (Epicurus). On the other hand shorter programs give significantly larger 
contribution (Occam). How does all this affect prediction? If M(x) describes our 
(subjective) prior belief in x, then M(y\x) := M(xy)/M(x) must be our posterior 
belief in y. From the symmetry of algorithmic information K(xy) ~ K (y\x) + K (x) , 
and M(x)^2~ K ^ and M(xy)^2~ K ^ we get M(y\x) m2~ K ^ x \ This tells us that 
M predicts y with high probability iff y has an easy explanation, given x (Occam & 
Epicurus) . 

The above qualitative discussion should not create the impression that M(x) 
and 2~ K ^ always lead to predictors of comparable quality. Indeed in the on- 
line/incremental setting, K(y) = 0(l) invalidates the consideration above. The proof 
of (J2J below, for instance, depends on M being a semimeasure and the chain rule 
being exactly true, neither of them is satisfied by 2~ K ( X \ See [Hut03j for a more 
detailed analysis. 

Sequence prediction algorithms try to predict the continuation x t G X of a given 
sequence X\...x t -\- We assume that the true sequence is drawn from a computable 
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probability distribution /z, i.e. the true (objective) probability of X\. t is fi(xi- t ). The 
probability of x t given x <t hence is fi(x t \x <t ) = fi(xi- t )/ ^(x <t ). Solomonoff's |Sol78j 
central result is that M converges to fi. More precisely, for binary alphabet, he 
showed that 

E E Kx<t){M(0\x <t ) - fi(0\x <t )) < \\n2-K{ii) + 0{l) < oo. (2) 

*=i x <t &{Q,iy-^ 

The infinite sum can only be finite if the difference M(0\x <t ) — fi(0\x <t ) tends to 
zero for t—>-oo with fi probability 1 (see Definition and [HutOl or Section El for 
general alphabet). This holds for any computable probability distribution fi. The 
reason for the astonishing property of a single (universal) function to converge to 
any computable probability distribution lies in the fact that the set of /x-random 
sequences differ for different \i. Past data x <t are exploited to get a (with t — > oo) 
improving estimate M(x t \x <t ) of fi(x t \x <t ). 

The universality property (Theorem EJ) is the central ingredient in the proof of 
(j2J). The proof involves the construction of a semimeasure £ whose dominance is 
obvious. The hard part is to show its enumerability and equivalence to M. Let M. 
be the (countable) set of all enumerable semimeasures and define 

i{x) := E 2- K ^u(x). (3) 

Then dominance 

Z(x) > 2~ K ^v{x) \fveM (4) 

is obvious. Is £ lower semi-computable? To answer this question one has to be 
more precise. Levin |ZL70j has shown that the set of all lower semi-computable 
semimeasures is enumerable (with repetitions). For this (ordered multi) set M. = 
M*™ := {z/i,z/ 2 ,i/ 3 ,...} and K{i>i) :=K{i) one can easily see that £ is lower semi- 
computable. Finally proving M(x) — £(x) also establishes universality of M (see 
[ST^ ILV97j for details). 

The advantage of £ over M is that it immediately generalizes to arbitrary 
weighted sums of (semi) measures for arbitrary countable M.. 

5 Universal (Semi) Measures 

What is so special about the set of all enumerable semimeasures M-l^^J The larger 
we choose M. the less restrictive is the assumption that M. should contain the true 
distribution fi, which will be essential throughout the paper. Why do not restrict 
to the still rather general class of estimable or finitely computable (semi)measures? 
It is clear that for every countable set Ai, 

^(x) := £m( x ) '■= E w„u(x) with w u < 1 and w u > (5) 
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dominates all v G M . This dominance is necessary for the desired convergence £ — > [L 
similarly to (J2J). The question is what properties £ possesses. The distinguishing 
property of M s ^ m is that £ is itself an element of M s ^ m . When concerned with 
predictions, £m £ is not by itself an important property, but whether £ is com- 
putable in one of the senses of Definition [T] We define 

X 

M\> M 2 there is an element of Mi which dominates all elements of M 2 

3peMi \/ueM 2 3w l/ >0 Vx : p{x)>w v u(x). 

X XX 

> is transitive (but not necessarily reflexive) in the sense that M\> M 2 >M% implies 

X X X 

Mi > M 3 and M D Mi > M 2 2 M 3 implies M > M 3 . For the computability 
concepts introduced in Sectional we have the following proper set inclusions 



M msr 

J comp 


c 


MZ r 


— KA msr 

^ enum 


c 


KAmsr 
J appr 


n 




n 


n 




n 


KAsemi 
J '■comp 


c 


\ A semi 
Jvl est 


C KAsemi 
^— ^ l enum 


c 


KAsemi 
J 1 appr 



where J\/[™ ST stands for the set of all probability measures of appropriate com- 
putability type c G {comp=finitely computable, est=estimable, enum=enumerable, 
appr=approximable}, and similarly for semimeasures Ai s c emi . From an enumera- 
tion of a measures p on can construct a co-enumeration by exploiting p(xi m ) = 1 — 
J2 yi . n jtx l n p(yi:n) ■ This shows that every enumerable measure is also co-enumerable, 
hence estimable, which proves the identity = above. 

. X 

With this notation, Theorem |3] implies M s ^ m > M s ^ m . Transitivity allows 

. X 

to conclude, for instance, that M s ^^>M 1 ^ ip} i.e. that there is an approximable 
semimeasure which dominates all computable measures. 

X 

The standard "diagonalization" way of proving M.i^M. 2 is to take an arbitrary 

X 

pEM\ and "increase" it to p such that p^p and show that p£M 2 . There are 7x7 

X 

combinations of (semi)measures Mi with Ai 2 for which M.i>M 2 could be true or 
false. There are four basic cases, explicated in the following theorem, from which 
the other 49 combinations displayed in Table follow by transitivity. 

Theorem 4 (Universal (semi) measures) A semimeasure p is said to be uni- 
versal for A4 if it multiplicatively dominates all elements of M in the sense 
W3w„ >0 : p{x) >w u u(x)\/x. The following holds true: 

_ X 

0) 3p : {p} > M.: For every countable set of (semi) measures M, there is a 
(semi)measure which dominates all elements of M. 

. X 

1) Mlnum — ^™m ; ^ e class of enumerable semimeasures contains a universal 
element. 



8 



Marcus Hutter, Technical Report IDSIA-05-03 



X 

ii) M™pp T > -M-tmum-' There is an approximable measure which dominates all enu- 
merable semimeasures. 

X 

Hi) J^ltT l ^^cmnp : There is no estimable semimeasure which dominates all com- 
putable measures. 

X 

iv) J^ S ap^r^^ap S pT : There is no approximable semimeasure which dominates all 
approximable measures. 



Table 5 (Existence of universal (semi) measures) The entry in row r and col- 
umn c indicates whether there is a r-able (semi)measure p for the set M. which con- 
tains all c- able (semi) measures, where r,c€L {comput, estimat, enumer, approxim}. 
Enumerable measures are estimable. This is the reason why the enum. row and 
column in case of measures is missing. The superscript indicates from which part 
of Theorem^ the answer follows. For the bold face entries directly, for the others 

X 

using transitivity of > . 



\ 


M 


semimeasure 


measure 


p 


\ 


comp. 


est. 


enum. 


appr. 


comp. 


est. 


appr. 


s 


comp. 


no m 


no m 


no m 


no lv 


no m 


no m 


no m 


e 


est. 


no m 


no m 


no m 


no lv 


no 111 


no m 


no m 


m 


enum. 


yes 1 


yes 1 


yes 1 


no lv 


yes 1 


yes 1 


no m 


i 


appr. 


yes 1 


yes 1 


yes 1 


no lv 


yes 1 


yes 1 


no lv 


m 


comp. 


no m 


no m 


no m 


no lv 


no m 


no m 


no m 


s 


est. 


no ul 


no m 


no ul 


no lv 


no 111 


no m 


no m 


r 


appr. 


yes 11 


yes 11 


yes 11 


no lv 


yes 11 


yes u 


no tv 



If we ask for a universal (semi)measure which at least satisfies the weakest form 
of computability, namely being approximable, we see that the largest dominated 
set among the 7 sets defined above is the set of enumerable semimeasures. This 
is the reason why M. s ^ m plays a special role. On the other hand, M.f^m is not 
the largest set dominated by an approximable semimeasure, and indeed no such 
largest set exists. One may, hence, ask for "natural" larger sets M.. One such set, 
namely the set of cumulatively enumerable semimeasures AicEM, has recently been 
discovered by Schmidhuber |Sch02j . for which even ^cem^M-cem holds. 
Theorem |3] also holds for discrete (semi) measures P defined as follows: 

Definition 6 (Discrete (Semi)measures) P(x) denotes the probability of 'xElN . 
We call P : IN— > [0,1] a discrete (semi)measure if J2xewP{ x ) — 1- 

Theorem |3] (i) is Levin's major result '1 A 071 Th.4.3.1 & Th.4.5.1], (ii) is due to 
Solomonoff jSoTTsj . the proof of M^p^MZZ, in [TV971 p249] contains minor 
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errors and is not extensible to (Hi) and the proof in LV97[ p276] only applies to 
infinite alphabet and not to the binary/finite case considered here. A complete proof 
of (o) — (iv) for discrete and continuous (semi) measures is given elsewhere. 



6 Posterior Convergence 

We have investigated in detail the computational properties of various mixture dis- 
tributions £. A mixture £^ multiplicatively dominates all distributions in Ai. We 
have mentioned that dominance implies posterior convergence. In this section we 
present in more detail what dominance implies and what not. 

Convergence of ^(x t \x <t ) to fi(x t \x <t ) with yU-probability 1 tells us that £(x t \x <t ) 
is close to fi(xt\x <t ) for sufficiently large t and "most" sequences Xx :oo . It says 
nothing about the speed of convergence, nor whether convergence is true for any 
particular sequence (of measure 0). Convergence in mean sum defined below is in- 
tended to capture the rate of convergence, Martin-L6f randomness is used to capture 
convergence properties for individual sequences. 

Martin-L6f randomness is a very important concept of randomness of individ- 
ual sequences, which is closely related to Kolmogorov complexity and Solomonoff's 
universal prior. Levin gave a characterization equivalent to Martin-Lof's original 
definition |Lev73j : 

Theorem 7 (Martin-L6f random sequences) A sequence xi :OQ is fi- Martin- L of 
random ([i.M.L.) iff there is a constant c such that M(xi :n ) <c- fi(xi :n ) for all n. 

One can show that a /i.M.L. random sequence xi :OQ passes all thinkable effective 
randomness tests, e.g. the law of large numbers, the law of the iterated logarithm, 
etc. In particular, the set of all /i.M.L. random sequences has /i-measure 1. The 
following generalization is natural when considering general Bayes-mixtures £ as in 
this work: 

Definition 8 (^/^-random sequences) A sequence Xi :00 is called [ij ^-random 
(ji.^.r.) iff there is a constant c such that £(xi : „) <c-//(xi :n ) for all n. 

Typically, £ is a mixture over some Ai as defined in (jSJ) , in which case the reverse 

X 

inequality £,(x)>fi(x) is also true (for all x). For finite Ai or if £EAi, the definition 
of /z/£-randomness depends only on Ai, and not on the specific weights used in £. 
For At = •M-twmi /V£-randomness is just /i.M.L. randomness. The larger At, the 
more patterns are recognized as non-random. Roughly speaking, those regularities 
characterized by some vEAi are recognized by /i/£-randomness, i.e. for Ai cAl^um 
some ///^-random strings may not be M.L. random. Other randomness concepts, 
e.g. those by Schnorr, Ko, van Lambalgen, Lutz, Kurtz, von Mises, Wald, and 
Church (see |Wan961 ILam871 ISch71j ). could possibly also be characterized in terms 
of /i/£-randomness for particular choices of Ai. 
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A classical (non-random) real- valued sequence a t is defined to converge to a*, 
short a t — >a* if Ve3t Vt >to : \a t — a*| <e. We are interested in convergence properties 
of random sequences z t (uj) for t^oo (e.g. zt(ijj)=^(uj t \uj <t ) — Li(uj t \uj <t )). We denote 
/^-expectations by E. The expected value of a function / : X* — > M, dependent 
on xi : t, independent of x t+ i :00 , and possibly undefined on a set of /i-measure 0, is 
E[f] = Y?xi. t ex t f J '( x i-t)f{ x i:t)- The prime denotes that the sum is restricted to %i- t 
with fi(xi;t)^0. Similarly we use P[..] to denote the /i-probability of event [..] We 
define four convergence concepts for random sequences. 



Definition 9 (Convergence of random sequences) Let zi(u),z 2 (uj),... be a se- 
quence of real-valued random variables. z t is said to converge for t — > oo to random 
variable z*(uj) 

i) with probability 1 (w.p.l) P[{u:Zt — >Z*}] = 1, 

ii) in mean sum (i.m.s.) :<^> Y^i^[{ z t ~ z *) 2 ] < °°; 

in) for every fi-Martin-Ldf random sequence (fx. M.L.) 

Wu: [3cVn: M(ui; n ) <cfi(ui; n )] implies z t (uj) — > z*(u) for t — ► oo , 

iv) for every ji/ ^-random sequence (fx.£.r.) 

\/uj: [3cVn:£(u; 1:n ) <cfx(u 1:n )} implies z t (uj) — > z*(oj) for t — > oo . 



In statistics, (i) is the "default" characterization of convergence of random sequences. 
Convergence i.m.s. (ii) is very strong: it provides a rate of convergence in the sense 
that the expected number of times t in which z t deviates more than e from z* 
is finite and bounded by J2uLiE[(z t — z*) 2 ]/e 2 . Nothing can be said for which t 
these deviations occur. If, additionally, \z t — z*\ were monotone decreasing, then 
\z t — 2*| = o(t~ l l 2 ) could be concluded. (Hi) uses Martin-L6f 's notion of randomness 
of individual sequences to define convergence M.L. Since this work deals with general 
Bayes-mixtures £, we generalized in (iv) the definition of convergence M.L. based on 
M to convergence /x.£.r. based on £ in a natural way. One can show that convergence 
i.m.s. implies convergence w.p.l. Also convergence M.L. implies convergence w.p.l. 
Universality of £ implies the following posterior convergence results: 



Theorem 10 (Convergence of £ to fx) Let there be sequences x\X2--- over a fi- 
nite alphabet X drawn with probability fx(xi :n ) G M. for the first n symbols, where 
fx is a measure. The universal posterior probability ^(x t \x <t ) of the next symbol x t 
given x <t is related to the true posterior probability ii(x t \x <t ) in the following way: 



EE 



/ g(x t |a;<t) 
n(xt\x<t) 



1 



t=l 



< \nw„ < oo 



where is the weight |pj) of fx in £. 
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Theorem El implies 



\] i(x't\x<t) — ► \] v{%'t\ x <t) for any x' t and 



(,(xt\x <t ) 
n(xt\x<t) 



1, both i.m.s. for t 



oo. 



The latter strengthens the result ^(x t \x <t ) / p(x t \x <t ) ^ 1 w.p.l derived by Gacs in 
LV971 Th.5.2.2] in that it also provides the "speed" of convergence. 

Note also the subtle difference between the two convergence results. For any se- 
quence x' 1:00 (possibly constant and not necessarily /i-random), p(x' t \x <t )—£(x' t \x <t ) 
converges to zero w.p.l (referring to Xi :00 ), but no statement is possible for 
£,(x' t \x <t ) / p(x' t \x <t ) , since liminf/i(xj|x< t ) could be zero. On the other hand, if 
we stay on the ^-random sequence (x' 1:00 = Xi :00 ), we have (,(x t \x <t ) / p(x t \x <t ) ^ 1 
(whether mfp(x t \x <t ) tends to zero or not does not matter). Indeed, it is easy to 
see that £(l|0 <t )//i(l|0 <t ) oc t — > oo diverges for A4 = {p,u}, p{l\x <t ) := |t~ 3 and 
u{\\x <t ) := \t~ 2 , although 0i :oo is /i-random. 

7 Convergence in Martin-L6f Sense 

An interesting open question is whether £ converges to p (in difference or ratio) in- 
dividually for all Martin-L6f random sequences. Clearly, convergence /i.M.L. may at 
most fail for a set of sequences with /i-measure zero. A convergence M.L. result would 
be particularly interesting and natural for Solomonoff 's universal prior M, since M.L. 
randomness can be defined in terms of M (see Theorem |7J). Attempts to convert the 
bounds in Theorem ITUl to effective p.M.L. randomness tests fail, since M(xt\x < t) is 

not enumerable. The proof given olM/y^hl in [LV971 Th.5.2.2] and jVLOOl Th.10] 
is incomplete. 1 The implication "M(aji :n )<c-//(a;i :n )Vn=^lim n _ i , 00 M(si :n )//i(a; 1 . n ) ex- 
ists" has been used, but not proven, and may indeed be wrong. 

Vovk |Vov87j shows that for two finitely computable semi-measures p and p and 
Xi :00 being fi and p M.L. random that 



If M were recursive, then this would imply posterior M — > p and Mj p — > 1 for 
every p.M.L. random sequence Xi :00 , since every sequence is M.M.L. random. Since 
M is not recursive Vovk's theorem cannot be applied and it is not obvious how 

lr The formulation of their Theorem is quite misleading in general: "Let /i be a positive recursive 
measure. If the length of y is fixed and the length of x grows to infinity, then M (y\x) / p(y\x) — > 1 
with fi-probability one. The infinite sequences to with prefixes x satisfying the displayed asymptotics 
are precisely ['=$■' and '<='] the ^.-random sequences.'''' First, for off-sequence y convergence w.p.l 
does not hold (xy must be demanded to be a prefix of u>). Second, the proof of '<=' is loopy (see 
main text). Last, is given without proof and is probably wrong. Also the assertion in |LV971 
Th.5.2.1] that St ■= E^^' (M^t \ x <t) — M( x 't \ x <t)) 2 converges to zero faster than 1/t cannot be 
made, since St may not decrease monotonically. 
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to generalize it. So the question of individual convergence remains open. More 
generally, one may ask whether £m — > /i for every /i/£-random sequence. It turns 
out that this is true for some Ai, but false for others. 

Theorem 11 (///^-convergence of £ to /i) Let X = {0,1} be binary and M.q>:= 
{fMg :/ie(l|x< t ) = 9 Vi, ^G©} &e i/ie set of Bernoulli(O) distributions with parameters 
#G©. Lei 9d 6e a countable dense subset of [0,1], e.g. [0,1] fl^ and /et ©g be a 
countable subset of [0,1] with a gap in the sense that there exist < 6* < 6*i < 1 such 
that[9 ,9 1 ]ne G = {9 ,9 1 }, e.g. © G = {±,|} or © G = ([0,f]U[f ,l])n^. T/^en 

If xi-.oc « /i/^Me D random with M.q d , then £,m @d i x t\ x <t) ~^n(xt\x<t), 

ii) There are fiEA4@ G and // / ^j^ Q( jandom xi :ao /or which ^MQ^ x t\ x <t)^^{ x t\ x <t) 

Our original/main motivation of studying ///^-randomness is the implication of The- 
orem ^2 that M jj, cannot be decided from M being a mixture distribution or 
from the universality property (Theorem EJ) alone. Further structural properties 
of M-lZum have to be employed. For Bernoulli sequences, convergence /i.£_M e .r. is 
related to denseness of A^e- Maybe a denseness characterization of M- S ^ m can 
solve the question of convergence M.L. of M. The property MgA1^™„ is also not 

sufficient to resolve this question, since there are M. 3 £ for which £ // and M. 3 £ 
for which £ ^— * //. Theorem can be generalized to i.i.d. sequences over general 
finite alphabet X. 

The idea to prove (ii) is to construct a sequence Xi :oo which is //0 o Af-random 
and //^Al-random for Oq^Oi- This is possible if and only if contains a gap and 
9q and 9\ are the boundaries of the gap. Obviously £ cannot converge to 9 and 9i, 
thus proving Al-non-convergence. For no 9e [0,1] will this Xi :oa be //# M.L.-random. 
Finally, the proof of Theorem ^2 makes essential use of the mixture representation 

X 

of £, as opposed to the proof of Theorem ITU1 which only needs dominance £> M.. 

8 Conclusions 

For a hierarchy of four computability definitions, we completed the classifica- 
tion of the existence of computable (semi)measures dominating all computable 
(semi) measures. Dominance is an important property of a prior, since it implies 
rapid convergence of the corresponding posterior with probability one. A strength- 
ening would be convergence for all Martin-L6f (M.L.) random sequences. This seems 
natural, since M.L. randomness can be defined in terms of Solomonoff's prior M, 
so there is a close connection. Contrary to what was believed before, the ques- 
tion of posterior convergence M/fj,— >1 for all M.L. random sequences is still open. 
We introduced a new flexible notion of ///^-randomness which contains Martin-L6f 
randomness as a special case. Though this notion may have a wider range of ap- 
plication, the main purpose for its introduction was to show that standard proof 
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attempts of M/fj,— > 1 based on dominance only must fail. This follows from the 
derived result that the validity of 1 for ///^-random sequences depends on the 

Bayes mixture £. 

References 

[HutOl] M. Hutter. Convergence and error bounds of universal prediction for general 
alphabet. Proceedings of the 12th Eurpean Conference on Machine Learning 
(ECML-2001), pages 239-250, 2001. 

[Hut03] M. Hutter. Sequence prediction based on monotone complexity. Technical Report 
IDSIA-09-03, 2003. 

[Lam87] M. van Lambalgen. Random Sequences. PhD thesis, University of Amsterdam, 
1987. 

[Lev 73] L. A. Levin. On the notion of a random sequence. Soviet Math. Dokl, 14(5): 1413- 
1416, 1973. 

[LV97] M. Li and P. M. B. Vitanyi. An introduction to Kolmogorov complexity and its 
applications. Springer, 2nd edition, 1997. 

[Sch71] C. P. Schnorr. Zufalligkeit und Wahrscheinlichkeit. Springer, Berlin, 1971. 

[Sch02] J. Schmidhuber. Hierarchies of generalized Kolmogorov complexities and nonenu- 
merable universal measures computable in the limit. International Journal of 
Foundations of Computer Science, 13(4):587-612, 2002. 

[Sol64] R. J. Solomonoff. A formal theory of inductive inference: Part 1 and 2. Inform. 
Control, 7:1-22, 224-254, 1964. 

[Sol78] R. J. Solomonoff. Complexity-based induction systems: comparisons and con- 
vergence theorems. IEEE Trans. Inform. Theory, IT-24:422-432, 1978. 

[VL00] P. M. Vitanyi and M. Li. Minimum description length induction, Bayesianism, 
and Kolmogorov complexity. IEEE Trans, on Information Theory, 46(2) :446- 
464, 2000. 

[Vov87] V. G. Vovk. On a randomness criterion. DOKLADY: Russian Academy of Sci- 
ences Doklady. Mathematics (formerly Soviet Mathematics-Doklady), 35(3):656- 
660, 1987. 

[Wan96] Y. Wang. Randomness and Complexity. PhD thesis, 1996. 

[ZL70] A. K. Zvonkin and L. A. Levin. The complexity of finite objects and the devel- 
opment of the concepts of information and randomness by means of the theory 
of algorithms. Russian Mathematical Surveys, 25(6):83-124, 1970. 



